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Abstract 

We prove a lower bound on the relative entropy between two finite-dimensional states in 
terms of their entropy difference and the dimension of the underlying space. The inequality 
is tight in the sense that equality can be attained for any prescribed value of the entropy 
difference, both for quantum and classical systems. We outline implications for thermody- 
namics and information theory, such as a necessary condition for a process to be close to 
thermodynamic reversibility, or an easily computable lower bound on the classical channel 
CO capacity. Furthermore, we derive a tight upper bound, uniform for all states of a given 

^~~^ dimension, on the variance of the surprisal, whose thermodynamic meaning is that of heat 

^^ capacity. 



0^ 

Oh 



Contents 

1 Introduction [2] 

1.1 Notation [3] 

2 Main results [3] 

H 2.1 Relative entropy vs. entropy difference H] 

^ 2.2 Dimension bounds on second moments [7] 

O^ 2.2.1 Maximum variance of the surprisal [7] 

2.2.2 Maximum heat capacity in finite dimensions [8] 

^ 3 Applications [TOl 

)^ 3.1 Thermodynamics applications [10] 

(^ 3.1.1 Approach to reversibility in equilibration processes [TO] 

^^ 3.1.2 Free energy vs. entropy density [12] 

■^ 3.2 Information-theoretic applications [13] 

3.2.1 Cost of wrong code, universal codes, and Shannon channel capacity . . . [13] 

3.2.2 Hypothesis testing and large deviations [E] 

3.2.3 Mutual information [16] 



o 
en 



/S 4 Proofs 

C^ 4.1 Proof of Theorem [I] M 

4.2 Proof of Theorem [2] [I9] 

4.3 Auxiliary Lemmas [21] 

4.4 Proof of Theorem [8] [23] 

5 References 



* david.reeb@tuni.de 
^ ni.wolf@tuni.de 



1 Introduction 

The relative entropy is a distance-like measure that appears in a multitude of areas, such as infor- 
mation theory, thermodynamics, statistics and learning theory, being of operational significance 
in various situations (see Section^ for a few applications). Also known as the KuUback-Leibler 
divergence, it was first introduced for probability distributions [KL51], and later generalized 
to quantum states |Ume62j . Another ubiquitous quantity is the entropy of a probability dis- 
tribution or quantum state |Sha48|, lvN32| . which in Thermodynamics had already played a 
central role because entropy differences characterize possible and impossible thermodynamic 



state transformations (see e.g. the Clausius inequality in Section 3.1.1). 

In this work, we lower-bound the relative entropy D{a\\p) between two states o", p (proba- 
bility distributions or quantum states) in terms of their entropy difference A = S{a) — S{p). 
Qualitatively, it is clear that such non-trivial lower bounds exist in any finite dimension due to 
the compactness of the state space, since A 7^ implies cr ^ p and thus D{a\\p) > by Klein's 
inequality [OP93J . Our main inequality (Theorem IT]) makes this quantitative and is furthermore 
tight, meaning that for each dimension d it provides the best lower bound on D{a\\p) in terms 
of A, both for classical and quantum systems. 

We note that any lower bound that can be derived by combining the tight Pinsker inequality 
|Csi671 IAE05| with the tight Fannes-Audenaert inequality |Fan73t IAud07j will not be tight and 
will be strictly weaker than the derived bounds, even in its functional dependence (Remark [6]). 

Also considering states of finite dimension d in Section |2.2[ we give a tight upper bound on 
the variance of the surprisal (or information gain), which is quadratic in logd (Section |2. 2. 1^ ; 
of course, the expectation value of the surprisal is just the entropy and is bounded by logd 
|UP93| . One thermodynamic implication of this result is an upper bound on the heat capacity 
of finite-dimensional systems (Section [2.2.2 ). 



The inequalities presented here arose out of, and are used in, an investigation of finite-size 
effects in Landauer's Principle |RW13j . but we expect them to have applications elsewhere in 
thermodynamics and information theory; some are outlined in Section [3| Furthermore, the 
finite-size bounds here arise in one-partite systems, whereas the Landauer scenario - the topic 
of |RW13| - is bipartite, involving a system and a thermal reservoir jLan61j . 

Physically, our bounds are especially interesting for quantum thermodynamics |GMM10l 
ISBL^ll] and generally for the thermodynamics of microscopic systems or devices. Furthermore, 
even a large heat bath may sometimes be reasonably treated as small, when the equilibration 
time with another system is so short that only a small part of the bath effectively interacts with 
the system. Our bounds can be applied to derive finite-size corrections to well-known physical 
laws and, for example, alter efficiency analyses of physical process like Carnot's or Landauer's 
|AG12[[RWl3] . 

By treating the Shannon and von Neumann (relative) entropies, our results are relevant to the 
conventional situation of many independent copies of a system state ( "thermodynamic limit" ) , 
averaging quantities over these copies ( "ensemble averages" ) . Thermodynamics and information 
theory can instead also be examined in the "single-shot setting" , necessitating extra parameters 
such as the success probability of a process (e.g. jRen051 lAbelH IEDR+12J ). Our setup is thus 
different from the one-shot scenario: Whereas the latter concerns a finite (small) number of 
systems, our results have implications in the limit of infinitely many finite-dimensional system 



copies. The variance computed in Section 2.2.1, however, can quantify how many copies of a 



finite-dimensional system have to be averaged before the Shannon or von Neumann entropies 
become sensible measures (see also |TH121ILil2] ). 



1.1 Notation 

All states a, p will be on a space of finite dimension d < oo. In the quantum framework, states 
are positive semi-definite d x d- matrices of trace 1 ( "density matrices" |NC00j ) ; in the classical 
(probability theory) framework, they are probability distributions on d atomic events [CT06J 
and can be identified with diagonal density matrices in the obvious way. We often require d >2 
to exclude the trivial one-dimensional case, where some statements become pathological. 
The entropy of a state p is defined as 

S{p) := -tr[plogp] . (1) 

Throughout, we use the natural logarithm, denoted by log, and employ the usual rules of calculus 



on the extended real line IR := RU {±00}, such as OlogO := 0; only in Section 3.2.1 will we also 
use the D-aiy logarithm log£)X := (log2;)/(logL'), with D > 1. A quantity of central interest 
will be the entropy difference A = A{a, p) of the states a and p: 

A(a, p) := Sia) - S{p) e [- log d, + log d] . (2) 

The other central quantity is the relative entropy between two states a and p: 

D{a\\p) := tr[alogcj]-tr[(jlogp] , (3) 

which equals -|-(X) if supp[(T] ^ supp[/9], and is finite otherwise, non-negative, and vanishes iff 
a = p. 

We also define binary versions of the entropy and relative entropy, i.e. for binary probability 
distributions (x, 1 — x) and (y, 1 — y) with < x, y < 1: 

H(x) := S (diag(x , 1 — x)) = xlog — 1-(1 — x)log , (4) 

X 1 — X 

X 1 — X 

D2{x\\y) := D(diag(x,l -x)||diag(y, 1 -y)) = xlog - -F (1 - x) log . (5) 

y 1 -y 

Note that the entropy difference A((T, p) changes sign under exchange of a and p, whereas the 
relative entropy D(a\\p) does not generally have any symmetry under exchange. For example, 
A = — logd forces p to be the maximally mixed state 1/d and a to be any pure state (any Her- 
mitian projector of rank 1), resulting in D(a\\p) = logd; whereas A = -|-logd interchanges these 
p and a and gives D{a\\p) = 00. The latter case is special as for any other A S [— log d, log d) 
there exist full-rank states a and p with A{a,p) = A, such that D{a\\p) < 00 is finite. 

For a more detailed discussion of entropic quantities we refer to |OP93j and jWeh78] or, in 
the context of classical and quantum information theory, to [CT06] and [ NCOOj . 

The acronyms LHS and RHS mean "left-hand side" and "right-hand side", respectively. 

2 Main results 



In Section |2.1 1 we state the tight inequality between relative entropy and entropy difference (The- 
orem [T| and describe properties and simplifications of the bound that are useful for applications 
(Section [3]). The tight upper bound on the variance of the surprisal (or heat capacity) is given 
in Section [2. 2[ The proofs follow in Section |4j 



2.1 Relative entropy vs. entropy difference 

To state our main inequality and its simplifications, we define for d > 2 and A E [— logd, logd]: 

M{A,d) := min {D2{s\\r)\ H{s) - H{r) + {s - r)\og{d - 1) = A} , (6) 

0<s,r<{d—l)/d 

1 — r 



N(d) := max r(l - r) log (d - I) ] , (7) 

0<r<l/2 \ r ) 

Nd := ^log2(d-l) + l. (8) 

(The expression log (d — 1) should always be read as {log{d — 1)) .) All of these quantities can 
be efficiently computed numerically as they involve optimizations over at most two bounded real 
variables. See also Fig. [Tl and Lemmas 13 and 14 (Section 4.3) for relations among (|6|-([8]). 



Theorem 1 (Tight lower bound on relative entropy by entropy difference). Let a, p he states 
of dimension d, with 2 < d < oo, and define A := S{a) — S{p). Then: 

D{a\\p) > M{A,d) , (9) 

with the function M(A, d) defined in Eq. M). 

Conversely, for any A G [—logd, logd], there exist a, p attaining equality in M). More 
precisely, for any pair (s, r) attaining the minimum in [m, the commuting d-dimensional states 

a := diag(l-.,-^,...,-^) , p := diag (l - r, -^, . . . , -^) (10) 

have entropy difference S{a) — S{p) = A and achieve equality -D(cj||p) = M(A,d). 

Theorem 2 (Properties of the tight bound). Let 2 < d < cxd. Then the function M{A,d) in 
the tight lower bound [m is non-negative, continuous and strictly convex m A G [—logd, logd], 
and continuously differentiable in the interior of this interval. It takes values M(0, d) = 0, 
M{- log d,d) = logd, M (log d,d) = oo, and M{A,d) < oo for A e [- log d, log d) . 

For any N > N{d), with N{d) from Eq. [%, the following lower bounds hold for all A: 

M(A,d)>A'(e#-l-|) > i + i^. (11) 

Easily computable choices for N are N = Nd = \ log^(d - 1) + 1 > N{d), or N = log^ d > N{d) . 

Remark 3 (Equality cases in Eq. ([9|). Regarding the equality statement in Theorem [Tl we 
remark that for any A G [— log d, log d] the minimum in ([6]) actually exists, i.e. is attained for 



some pair {s,r) (see Section 4.2), and equals oo G IR for A = logd. Note that for states of 
the form (10), it is D{a\\p) = D2{s\\r), S{a) = H{s) + slog(d — 1) and similar for S{p), which 
explains the connection between ^ and ([6]). In Remark |11| we elaborate on the states (10), 
which come all from the same exponential family. 



For A 7^ 0, the pair {a, p) from (10) constitutes, up to simultaneous unitary equivalence, the 
unique d-dimensional states achieving equality D{a\\p) = M(A,d) and S{a) — S{p) = A. This 
follows from the proof of Theorem [l] in Section [4. 1| as the optimal states for A 7^ are necessarily 
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Figure 1: Upper and left panels: The upper (red) curves show M(A,d) from Eq. (rol) (tight lower 
bound of Theorem [l]) for d = 2, 10, 50. The black and blue solid curves below are the lower 
bounds from Eq. ( |11[ ) with the optimal N = N{d), the dotted blue curve is the quadratic lower 
bound A'^/2N(d) (for A > 0). At A = it log d, all these lower bounds approach 2 in the limit 
d — )• oo, whereas M{— log d,d) = logd and M (log d,d) = oo. Lower right panel: The red dots 
show N{d) for 2 < d < 100 (Eq. ([T])), which approaches its easily computable upper bound N^ 
(Eq. (jsl) and Remark ^ as d — t- oo and which is lower-bounded by N^ — 1 (blue curves) . 



of the form (10) with < s,r < {d — l)/d, and since for A 7^ the pair {s,r) attaining the 
minimum in M| is unique (which is shown in our proof of the convexity of Af (A, d) in Section 
4.2). For A = 0, exactly the pairs with a = p attain equality in ([9]). 

As inequality ^ is tight for commuting density matrices, it is tight for classical probability 
distributions (diagonal density matrices) as well. 



Remark 4 (Goodness of the lower bounds in Eq. ( 11 )). One can check that the function M(A, d) 
is smooth around A = (see Section 4.2) and that the RHS of (11) with N = N{d) is its cubic 
Taylor expansion. This is thus the best cubic lower bound possible, and A'^/2N{d) is the best 
quadratic lower bound for A > (Fig. [I|; it is however not a lower bound for small A < 0. 

The lower bounds in (11) are quite good (cf. Fig. [I| even for relatively large |A|: For any 
constant t € (-1,1), the states ^ with s = (1 + t)/2, r = (1 - t)/2 give A = A{a,p) = 
tlog(d-l) and M{A,d) < D{a\\p) = D2(s||r) = ilog(l + t)/(l -t), whereas the lowest bound in 
( 11 ) gives ~ 2t^ (the cubic term vanishes as ~ 1/ log d). Even for the large values A = ±2 log d, 



the lower bound is thus tight (for large d) up to at most 10%. 

The quantity N{d) from Theorem ^ appears in the upper bound in Theorem^ as well, see 
Remark [9j For this quantity, see also the lower right panel in Fig. [T| 



Remark 5 (Monotonicity of M{A,d) in A). The tight lower bound M{A,d) is strictly mono- 
tonically decreasing for A < and strictly increasing for A > (cf. Fig. IT]). This follows since 
the non-negative function M(A, d) vanishes at A = and is strictly convex by Theorem^ (see 
also Fig. [l]). As our convexity proof in Section 4.2 is quite involved, we give now a simpler proof 
of monotonicity. We actually prove 



M{XA,d) < XM{A,d) for AG [-logd,logd]\{0}, AG (0,1) . 



(13) 



Let first A G (O,logd), A G (0,1), and let a, p be states with D{a\\p) = M{A,d) and 
S{a) — S{p) = A. Define states a^ := fia + {1 — p)p for p G [0, 1]. As S{a^) is continuous in p, 
there exists p' G (0, 1) with S{a^i) — S{p) = AA, and by strict concavity of the entropy we have 



AA > p'S{a) + (1 - p')S{p) - S{p) = p'A , 
i.e. p' < A. Convexity of the relative entropy |OP93j finally gives 

M(AA,d) < Dia^>\\p) < p'D{a\\p) + {l-p')D{p\\p) < XM{A,d) . 



(14) 



(15) 



(13) holds for A = logd as well, since M(AA, d) < oo due to AA < logd. The proof for A < 



is similar, now replacing p by some state p^/ = p' p + (1 — p')o'. 

Remark 6 (Lower bounds from the Fannes-Audenaert and Pinsker inequalities). A weaker 
lower bound on the relative entropy D(a\\p) in terms of the entropy difference A = A(cr, p), 
as in Theorem [II can be obtained by combining the Fannes-Audenaert |Fan731 IAud07] and 
Pinsker |Csi67j inequalities: Writing T := \\a — p\\i/2 for the trace distance (or total variation 
or statistical distance |CT06j ) between the states a and p, we have the bound |Fan73l IAud07| 



lAI 



\Sia)-Sip)\ < Tlog{d-l) + HiT) =: hd{T) < r(l + log((i - 1) + log l/T) 



(16) 



the first inequality being tight, and the sharpened Pinsker bound |Csi671 IUTOGI IHUTSlj IAE05J 

D{a\\p) > s{T) > 2T^ , (17) 



where s : [0,1] — )■ [0, oo] is a function |AE05j such that the first inequality is tight (for any 
dimension d >2) and which is lower-bounded by its quadratic Taylor expansion, s{x) > 2x^. 
If now A G [— logd, logd] is given, we can invert the function /i^^Lq (ci-i)/d] ' [0' (^ ~ ^)/d] — ^ 



[O,log(i] from (16), or lower-bound the inversion of its RHS, to get a lower bound on T: 



T > h^'m) > 



1 



e l + log((i- 1) -log|A| 



(18) 



l)/e 



where the prefactor is (e 

D{a\\p). This approach, however, can never yield a quadratic lower bound 



0.63. Plugging either of this into (17) yields a lower bound on 

A^ near A 



0, 

as ([9]) and (11)-(12) together do, since the tight lower bound s{T) in (17) is quadratic near 
T = and since hd from (16) does not satisfy /i^ (|A|) > c(d)|A| for any positive d-dependent 



constant c{d). Numerically, one actually sees that, for all d > 2 and A / 0, the lower bound 



obtained by plugging the RHS of (18) into the RHS of (17) is worse than the RHS of (11) with 
N = N{d) (and even worse than the quadratic lower bound A? /2N{d) for A > 0). 

Furthermore, this approach can only ever yield lower bounds that are invariant under A 1— )■ 
— A since the Fannes-Audenaert and Pinsker inequalities are both symmetric in a and p. The 
tight lower bound M(A,d) however does actually not have this invariance (see Fig. [I]). 



Remark 7 (Dimension-independent bounds are trivial). The non-trivial lower bounds (i.e., that 
are strictly positive for A ^ 0) on the relative entropy from Theorems [I] and ^ depend explicitly 
on the dimension d < oo. This has to be so as any dimension- independent bound will necessarily 
be trivial: Setting t := A/ log(d— 1) in the states of RemarkHj with any constant A S (— oo, +oo) 
and for large enough dimension d, gives A{a, p) = A and D{a\\p) = O (2A^/log^((i — 1)) — )• 
as d — > oo, so that is the best possible dimension-independent lower bound for any fixed value 
of A; this also holds for states over infinite-dimensional Hilbert spaces. In this case, however, 
the lower bound is never attained for A 7^ as L'(cr||p) = would imply a = p |OP931 [BR97J . 
and thus A = (if the entropies S{a), S{p) are at all defined). 

We further remark that the optimal lower bound M(A, d) is a decreasing function of d, 
implying that the finite-size corrections in applications (see Section pi) will be smaller for larger 
systems. To see this, let d' > d > 2, A G [— logd, logd], and let s,r be optimal variables when 
computing M(A, d) in ([6]). Now define s' := s, and find r' such that the entropy difference 



A{a' , p') between d'-dimensional states a', p' as in (10) equals the given A; if A 7^ 0, r' will be 
closer to s' = s than r is to s, such that M(A, d') < D2{s'\\r') < L'2('S||r) = M(A, d) with strict 
inequality for A 7^ 0. 

Proof of Theorems [7] and [^ The main part of the proof consists in reducing the minimization of 
D{a\\p) over (quantum) states a, p with a fixed value of A(cr, p) = A to the simpler minimization 
over two bounded real variables in ([6]). We give the full proofs in Sections 4.1-4.3, D 



2.2 Dimension bounds on second moments 



In Section 2.2.1 we derive a tight upper bound on the variance of the surprisal in terms of the 



dimension of the underlying space. Translating to thermodynamics in Section [2.2.2 this yields 



an upper bound on the second moment of the energy of thermal states or, equivalently, on the 
heat capacity of finite-dimensional systems. 

The derived bounds have apparent connections to the relative entropy inequalities from The- 
orems IT] and [2] Namely, the optimal states are of the same form and the (optimal) bounds involve 
the same quantities (see Remarks^and 11). Also, all the bounds are dimension-dependent and 
become trivial for infinite-dimensional spaces (cf. Remark ul). Furthermore, the heat capacity 
bound of Corollary 10 is used in JRW13] in a bipartite situation to actually lower-bound a rela- 
tive entropy term in an indirect way, as the direct bound by Theorem [l] would necessarily depend 
on an undesired entropic quantity, i.e. one of the "wrong" subsystem. 

2.2.1 Maximum variance of the surprisal 

In a classical random experiment described by a probability distribution p = diag(pi,p2, . . • ,Pd), 
the information gain upon outcome i is (— logpi), which is the unique sensible information 
measure in the limit of many indepedent experiments |Sha481 ICT06J . Equivalently, the surprise 
about obtaining i may be quantified by the surprisal {—logpi). The (Shannon) entropy ([T]) is 
the expectation value of the surprisal, S{p) = YliPi{—^ogpi) = {—logp)p. In this section, we 
look at its second moment, i.e. the variance or fiuctuation of the surprisal: 

varp(-log/)) := ^Pi(- logpi)^ - I ^Pi(- logpi) j = tr p {- log p - S {p)f . (19) 

In classical coding theory, when the source signals are i.i.d. distributed according to p, optimal 
prefix codes assign a codeword length of roughly ~ (— log pi) to symbol i |CT06j . The expected 
codeword length is thus ~ S{p) with fiuctuation ~ ^yvaTp{— log/9), which iinplies a certain 



fluctuation in the lengths of encoded messages. (This holds up to an overall factor logarithmic 
in the size of the code alphabet, see Section 3.2.1 ) Similar second-order effects in hypothesis 



testing using only finitely many copies have recently been investigated in |TH121 ILil2| . 

The above definitions in terms of a density matrix p are sensible in the quantum framework 
as well, and have similar interpretations |Sch951 iNCOOl ISWOlj . Note that S{p) and varp(— \ogp) 
depend both only on the eigenvalues {pi} of the density matrix p. 

Our main theorem here places a tight upper bound on the variance of the surprisal, only in 
terms of the dimension d of the system. A non-tight upper bound is implicit in |PPV10j . where 
the term '^^iPi ^°§ Pi i^ (19' ^^^ been bounded. For the expectation value of the surprisal, i.e. 



the entropy, a tight upper bound is of course well-known: S{p) < \ogd. 

Theorem 8 (Maximum variance of the surprisal). Let p he a state on a d-dimensional system. 
Then, for d >2, 



vaVp{—logp) < N{d) 



< Ki 



( \-r 
= max r(l — r) ( log (d — 1) 

0<r<l/2 V ^ 

= Jlog2(d-l) + l. 



(20) 
(21) 



(Cf. definitions tTV and ^, and Lemma 14 ) For d= 1, varp(— log/?) = 0. 

For d>2, let r = ra attain the m,aximum in (20). Then the d-dimensional state 



diag 1 - rrf. 



d-l 



Td 

d-l 



(22) 



achieves equality varp(— logp) = N{d). 



Theorem [8] is proved in Section [4. 4| by the method of Lagrange multipliers. The proof shows 
that (22) is the unique p achieving equality, up to unitary equivalence. 



Remark 9 (The quantities N{d) and Nd)- N{d) from Eq. ([7| is well approximated by the easily 



computable A^^ = \ log^(d 



1) -|- 1 since, by Lemma 

Nd > N{d) 



14 



> Nd 



(see also Fig. 1 lower right panel), 
1 . (23) 



One can even show N{d) = Nd — O (1/ log d) for d — )• oo, the optimal r in ¥In being r^ = 1/2 



l/log{d—l) + (l/log^ d). Instead of the minimization n7h, one may compute N{d) numerically 
by finding the optimal r = rd £ [0, 1/2] as the (unique) solution of (1 — 2r) log -^{d — 1) = 2 
and plugging it back. 



Note that the quantity N{d) from the optimal upper bound (20) appears in the quadratic 
Taylor term of the optimal lower bound M{A,d) in ( |11[ ) as well (cf. Remark El). This can be 
understood in a pedestrian way by minimizing D{p + e\\p) at fixed p and for small e (with 
[p, e] = 0; see beginning of the proof of Theorem [I]) under the constraint S{p -|- e) — S{p) = 5 
(small), which gives (5^/2varp(— logp) -|- 0{6^). Finally minimizing this over all p, the quadratic 
term of M{6,d) is therefore 6'^/2N{d) by Theorem Isl 



2.2.2 Maximum heat capacity in finite dimensions 

We now explain the thermodynamic significance of Theorem [8J Let H he a Hamiltonian of a 
d-dimensional system, i.e. a Hermitian d x d-matrix (diagonal for classical systems). Then, at 
any temperature T G (0,oo), the corresponding thermal (or equilibrium) state is 



-H/T 



PT 



tr[e-^/^] ' 



(24) 



with units chosen such that Boltzmann's constant ks = 1- The (average) energy of the thermal 
state is E{T) := tT[HpT], and the heat capacity C{T) quantifies the rate of change of the 
system energy upon temperature variation: 



cm 



dE 



d 
dT' 



-M 



H 



-H/T 



tr [e-H/T] 



var 



Pt 



{H/T) = varp^(-log/5T) , (25) 



where we omitted the httle computation of the derivative, and used in the last step that the 
variance is unchanged under addition of a constant term (proportional to 1). 



Eq. (25) shows that the heat capacity does not depend on H and T separately, but only on 



the thermal state pT- Note that every full-rank state p can be interpreted as the thermal state 
of some Hamiltonian, and common extensions of the above framework include even some (or 
all) non- full-rank states; it is for example conventional to allow T E [0, oo] and define po to be 
the normalized projector onto the ground space of H^ H/oo := 0, and C{co) := lim^^oo C{T). 
Further note that, by (25), the heat capacity also equals the energy fluctuations var pj,{H), 



i.e. the second moment of the energy, up to a factor of T^. 
Theorem |8] has thus the following corollary: 

Corollary 10 (Maximum heat capacity in d dimensions). Let H he any Hamiltonian on a d- 
diniensional system, and let T € [0, cxd]. Then its heat capacity C{T) is uniformly bounded in 
terms of the dimension: For d > 2, 



C{T) < N{d) < Nd 
with N{d) from Eq. ^. For d=l, C{T) = 0. 



1 



\og\d-l) + l , 



(26) 



Note that the first bound in (26) is tight for any d: The optimal state p from (22) has 
full-rank and is thus the thermal state of the Hamiltonian H := — log/9 at temperature T := 1. 



Remark 11 (Exponential family of optimal states (10) and (22)). The optimal states p and a 



from (10) come, for all values of A, from the same exponential family: defining a ti-dimensional 
"Hamiltonian" Hopt '■= diag(— 1,0, . . . ,0), it is cr, p = e~'^°p*/^'^'''/tr ^e~^°p*/'^''-p] for some "tem- 
peratures" T(j,p S [0, oo]. The same is true for the state (22) having maximal surprisal variance 
or heat capacity; thermal states with one large occupation number (eigenvalue) ~ 1/2 and 
completely degenerate small occupations have thus the largest energy fluctuations |Mac03j . 

On an A'"-particle system, e.g. the space C = (C )'^ of A^ /-level particles, the Hamiltonian 
Hopt means physically that the system energy is minimized (—1) when each of the N particles is 
in a preferred state |0) and equals otherwise, irrespective of the specific state. This very strong 
interaction between all A^ particles leads, at some temperature Tcrit, to the largest possible heat 
capacity of any d = / -dimensional system by Corollary 10 



C{T,„t) = Nil") 



N, 



1 



d=i^ 



log^/^ 



N' 



log^Z 



(27) 



This is in stark contrast to a system of A^ independent (non-interacting) particles, whose heat 



capacity is proportional to A^, i.e. "extensive", whereas (27) is faster than extensive. 



When a system's heat capacity C{T)/N per particle diverges at some temperature T = Tcrit, 
one sometimes speaks of a second-order phase transition, and the system can then absorb or 
release energy density by just "reorganizing" its state without temperature change |Mac03| . 



Corollary 10 shows explicitly that such effects cannot occur for finite(-dimensional) systems. 



3 Applications 

Here we outline some implications for thermodynamics and information theory of Theorem [l| 



the inequality relating relative entropy and entropy difference (see Section 2.1). 
3.1 Thermodynamics applications 



In Section 3.1.1 we examine how slowly equilibration processes |AG12| have to be conducted 



to make them (close to) thermodynamically reversible. A relation between an intensive and an 



extensive quantity in many-particle systems is given in Section 3.1.2 Regarding the extensivity 
of the heat capacity in many-body systems, see also the previous Remark 11 

The following sections also serve to illustrate the prominence of relative entropy and entropy 
difference in thermodynamics and statistical physics. 

3.1.1 Approach to reversibility in equilibration processes 

In thermodynamics it is a common assumption (which can be justified in specific models) that a 
system with a Hamiltonian H and in weak interaction with an environment at temperature T will 



"equilibrate" to the thermal final state pf = e ' /tr [e ^/^] (see Section 2.2.2), irrespective 
of its initial state pi. The system's energy change associated with such a spontaneous state 
change is called heat flow or heat AQ |PW78| 11^12] : 

AQ := t,[{pf-pi)H] = TiT[{pf-pi){-\ogpf)] . (28) 

One can relate this to the system's entropy change AS := A{pf,pi) = S{pf) — S{pi): 

^ = AS-D{pi\\pf) < AS. (29) 

(In order for all quantities to be well-defined, we assume pj to be a full-rank state, i.e. assume 
T G (0,oo]; for simplicity and without further mentioning, we assume all states in this section 
to be of full-rank or at least of the same support.) 

The above equilibration processes can also be conducted in a stepwise fashion, which was 
presented and analyzed in detail by Anders and Giovannetti |AG12j . One can view this as an 
attempt to formalize the vague notion of "slowness" of an equilibration process, which according 
to common physics folklore should make the process "thermodynamically reversible". We now 
recapitulate some elements from |AG12| and complement their analysis by a lower bound on 
how "close" a process can be to reversibility. 

In a /c-step process, adjust the system Hamiltonian successively first to Hi, then instanta- 
neously to H2, . . . , and finally to Hj. = H, and let the system equilibrate with an environment 
at temperature Tj in each step j = 1, . . . ,k (often, it will be either Hj = H for all j, or Tj = T 
for all j). We denote the associated intermediate thermal states by pj := e^ J' ^/tr [e~ •'' ^] 
(note, Pk = Pf) and define po := pi. The entropy change AS of the overall process equals just 
the sum of all changes A{pj, pj^i), and the sum of the single-step quantities AQj/Tj satisfies. 



by (29), 



J2^ = iZi^^Pi^Pi-^^-^^P^-^M = ^S-J2D{pj^i\\p,) < AS. (30) 

j=i 1 j=i j=i 

The inequality between the process quantity on the LHS and AS* is the Clausius Theorem 
|AG12j . often cited to be an incarnation of the Second Law of Thermodynamics. Note that, for 
Tj = T, the LHS is just proportional to the total heat flow into the system, ^ ■ AQj. 
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In the special case |AG12j where intermediate steps j = 1, . . . 
the states pj interpolate linearly between pi = po and pj = pk, i.e. 



k — 1 are chosen such that 



Pj 






for j = 0, . . . , /c , 



then the LHS of (30) can also be lower-bounded in terms of the entropy difference [AG 12) : 



AQ, 



i=l ^' 



AS 



> AS 



D{pf\\pi) + D{p,\\pf) 



D{pf\\pi) + D{p^\\pf) 



+ ^D{pj\\p,_,) 



(31) 

(32) 
(33) 



Thus, as the number of steps k in the interpolation (31) becomes finer (and if pi, pj have the 

same support), one has "^j AQj/Tj — )• AS. 

J2j ^Qj/'^j depends on the details of the process, whereas AS" 

initial and final state. 

Any process /Oj i— )■ pi i— t- . . . i— t- pj satisfying equality ^ • AQj/Tj = AS is called (thermo- 
dynamically) reversible., as intuitively one expects that the reverse of such a process leads back 



This is remarkable since a priori the quantity 
A{pf,pi) depends only on its 



to the original situation. This intuition can be made rigorous for the process (31): The entropy 
production A5' = —AS" of the reverse process pf i— )• Pk-i ^ ■ ■ ■ ^ Pi exactly cancels AS, 
and also the process quantity ^ • AQ'-jT'- will come close to AS" ^ — J2j ^Qj/'^j by reason- 
ing analogous to (30) and ( |33[ ). For constant temperatures Tj = T 
(almost) no heat is produced during the entire cyclic process pi i— )• 



the last fact means that 
. ^ Pf ^ . .. ^ Pi, i.e. 
(almost) none of the work expended to (gradually) alter the Hamiltonian |PW78j is converted to 
heat, which physically is a less useful form of energy than work. In actual physical realizations, 
thermodynamic processes become irreversible when the system state p{t) is not at all times t 
close to the thermal state determined by the system Hamiltonian H{t) and the environment 
temperature T{t). This happens for example when the process is conducted too fast so that the 
system cannot fully equilibrate at each infinitesimal step. 

From this reasoning, one can quantify the degree of irreversibility of the process Pi ^^ pi ^^ 
. . . h^ Pf hy the quantity X^j=i -D(/9j_i||/9j) in (30). This corresponds to the amount of work 
wasted at least as heat in any cyclic completion pj i— )• pi i— )• . . . i— )• pj = p^ i— )• pk+i i— )• . . . i— )• 
Pi, since X]j=i" ^^i/^i — ~'l2j=i^iPj-i\\Pj) by (30b. Quantitatively, denoting the 



Pk+m 

minimal temperature T„ 



mini<,<fcTj, the excess heat production is at least 



W, 



waste 



> Tmin ^D{pj^i\\pj) 



which is exact if Tj = T for all j . Theorem IT] now lower-bounds the sum in ( 34 ) : 

Y,D{pj-i\\pj) > Y.M{A{pj^i,pj),d) = A;^-M(A(p,_i||p,),d) 



i=i 



i=i 



i=i 



^ ^^ E^^^/'j-i'^^)'^ 



kM { ^^, d 



> 



1 (ASf 



2 J ' 



k 3 log^ d 



(34) 



(35) 

(36) 
(37) 



where d < oo denotes the dimension of the system, the second inequality is by convexity of the 
function M (Theorem^, and we exemplarily used the lower bound (12). 
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Achieving a degree e of reversibility by a stepwise process thus necessitates a minimum 
number k = 0(l/e) of steps via Eq. (37). When k interpreted as the time duration of the 
entire process - assuming that each equihbration step consumes roughly equal time - then (37) 



substantiates the folklore whereby thermodynamically reversible processes have to be conducted 
"infinitely slowly" . Our estimate is thus relevant for fundamental thermodynamics and especially 
for small systems [SBL^llj . as it delineates where the idealized but commonplace notion of 
reversible process can apply. It also provides new heat bounds for processes out of equilibrium 
in the area of non-equilibrium thermodynamics |Lin83t IJar991 IJarllj . 



Although the lower bound (36) is essentially tight in the typical thermodynamics situation 
where only the entropy difference AS" between two states is known, it becomes trivial for AS* = 0. 
In this and other cases, when in addition the initial and final states pi, pf are known, one may 
use an estimate similar to (35)-(37) but based on Pinsker's inequality (17): 



k k 



PjWl 



> 



j=l 



(gillP.-. 



Pi 111 



> 



\Pi 



PfWl 



2k 



(38) 



On the topic of stepwise processes we finally remark that the approach to reversibility 
J2j=i ^Qj/'^j ~^ ^"S* for fc — )• oo is not special to the linear interpolation process (31) [AG12J . 
Rather, for any (piecewise continuously differentiable) curve p{t) in state space with p(0) = pi, 
p(l) = Pf, a discretization at points = to < ^i < • ■ • < ^fc = 1 gives 



Y^AQj 

^ T,- 
i=i ^ 



k 

j=i 



ti[i-\ogp{tMpitj)-pitj-i))] 

tr [{-log pit)) dpit)] = I dtiT[-p{t)\ogp{t)] 

Jo 

p{t) log Pit)] = Sipf)-Sip^) = 



t=0 

dt — tr [pit) 





dt 



AS 



(39) 

(40) 
(41) 



with convergence as the discretization becomes finer, k 



oo and maxj [tj 



tj-i[ — ;■ (i.e. a 



Riemann sum). Thus, any state change pi i— )• pf can be made thermodynamically reversible 
(when supp[pj] = supp[/9/]). For the discretized process pit) we do however not have a lower 
convergence estimate as in (33) (the upper bound from the Clausius Theorem (30) holds of 
course for any discretization). 

In this section, we have considered thermalizing processes, bringing an arbitrary state pi to a 
thermal state pf, and have measured the heat production w.r.t. the Hamiltonian H corresonding 
to the final (thermal) state [AG12J . This leads to the Clausius inequality (30). 

In |RW13j we use Theorem [l] in the reverse situation where an initially thermal state pi is 
used as the resource in a process leading away from equilibrium. The heat production is again 
measured w.r.t. the system's Hamiltonian, which there however is related to the initial state and 



reverses the inequality (30) [AG12|, lRW13j . Furthermore, the paper [RW13] concerns a bipartite 
scenario - the Landauer process involving a system and a thermal reservoir |Lan61j - where 
a Second Law-like statement can be formulated more properly and where the above stepwise 
process may be implemented by swapping the system and reservoir states. 



3.1.2 Free energy vs. entropy density 

To further elucidate the thermodynamic meaning of the q uantity Dipi[[pf) for a thermal final 
state Pf = e~ ' /tr [e~ ' ] (cf. Eq. (29) in Section 3.1.1), we relate it to the work extractable 
at constant temperature from the state pi, and then examine it in a many-particle system. 
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For this, consider an isothermal process, i.e. where the temperature T remains constant and 
only the Hamiltonian is changed from its initial value Hq = H in k successive steps to Hi, 
Hk = H, at each of which the system equilibrates as in Section [3.1.1 The total heat flow 



^Q '■= ^7=1 '^Qj during the process then satisfies the Clausius inequality TAS > AQ (Eq. 



■'J 



(30j)), so that 

TD{p,Wpf) = -tr[Hipf-p,)]+T[S{pf)-Sip,)] = F{pi) - F{pf) (42) 

> -AE + AQ= - AW , (43) 

where we have defined: the free energy F{p) := tr [Hp] — TS{p) of a state p (at temperature 
T and for Hamiltonian H), the internal energy increase AE := tT[H{pf — pi)], and the work 
AW := AE - AQ done on the system [PWTHI ESU]. 



According to Section 3.1.1, equality in (43) can be approached by a suitable (reversible) 



process (note that then the jump from H = Hq to the first equilibration step Hi ss — T log pi 
may be big, whereas the further steps Hi i— ;■ . . . i— t- H^ = H are small). Thus, the amount 
T D{pi\\pf) = {—AW)max of work can be extracted from the state pi by a thermodynamic 
process at temperature T and using the internal energy function H. Conversely, for given 
temperature T and Hamiltonian H, this is also the maximum amount of work extractable from 
Pi since, for any process leading to a final state p'f (not necessarily thermal for either H or T), 

-AW' = -AE' + AQ' < - tr [H{p'f - pi)] + T [S{p'f) - S{pi)\ 
= F{p,) - F{p'f) = [F{p,) - F{pf)] - [F{p'f) - Fipf)] 
= TD{p,\\pf)-TDip'f\\pf) < TDipiWpf) , (44) 

where the last inequality amounts to the "thermodynamic inequality" |OP93] . i.e. the fact that 
the free energy F{p) attains its minimum at the thermal state p = pf (uniquely for T ^ 0). 
Theorems [l] and [2] thus lower-bound the extractable work at constant temperature T: 

where in the last step we have assumed a system of A^ /-level particles, i.e. d = I (cf. Remark 



11), and defined the change in entropy density As := AS/N = {S{pf) — S{pi))/N |BR97j . 



Inequality (45) seems quite unusual as its LHS is the "extensive" free energy difference or 
extractable work whereas the RHS is an "intensive" quantity, given by the entropy density 
and temperature; moreover, in the "thermodynamic limit" A^ — )• oo, the inequality is essentially 



tight. The reason for this is that states attaining equality are of the form ( 10 ) , which are strongly 



correlated as discussed in Remark 1 1 , such that one cannot speak of few-particle properties and 



the designation "extensive" is not appropriate. 
3.2 Information-theoretic applications 



We have already outlined in Section 2.2.1 the meaning of y^var^^^Iogp) as the fluctuation in 
codeword length of an optimal prefix code; for a source with d distinct signals this fluctuation is 
at most ~ 2 ^ogd by Theorem 8 In the following, we discuss implications in information theory 
of the lower bound on the relative entropy (Theorems [l] and [2]) . 

3.2.1 Cost of wrong code, universal codes, and Shannon channel capacity 

For a source producing i.i.d. signals i according to a classical probability distribution p = {pi}f^i. 
Shannon's source compression theorem [ Sha48l ICT06l| shows any prefix code with a D-aiy al- 
phabet to have an average length of at least S{p)/ log D per encoded signal. The lower the 
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entropy of the signal distribution, the shorter on average the encoded message can be. This 
length is in fact achievable - up to less than 1 alphabet symbol - by assigning codewords of 
length \—logj^pi] to the signals. 

If one however wrongly assumes the signals i to be distributed according to cr = {qi}f^i and 
constructs a code for this distribution, with codewords of length [— log^) qi] , then the average 
code length Lu will be 

L. = t.P^\-logn<l^] > -y^EP^logq. = ^ + ^. (46) 

^-^ log D ^-^ log D log D 

The last term is the cost of the wrong code |CT06j beyond the optimal average code length 
S{p)/ log D when one knew the correct distribution. 

Theorems [T] and [2] give a lower bound on this penalty just in terms of the difference 6 = 
{S{p) — S{a)) I log D between the supposed length S{a)/ log D and the optimal achievable length 
S{p)/\ogD: 

D{ph) > M{5\ogD,d) ^ 2 21ogZ) 
logD - \ogD - log2((i_i)+4 ' ^ ' 

where the last inequality holds only for positive expected savings (^ > 0. 

When the signals i G {!,..., d} follow one of the distributions p , where the parameter 
6 G {!,..., m} is not known, one may choose a coding distribution a ("universal code") that 
minimizes the maximal occuring penalty or redundancy (see |CT06] . Section 13.1): 

R* := minmaxL»(p''||(T) . (48) 



The theorems from Section 2.1 give an easily computable lower bound on the quantity R*: For 
this, denote by Smin and Smax be the minimal resp. maximal entropy S{p^) among the states 
p^ . Then using the properties from Theorem [2j and Remark [sj we have: 

R* > minmax Af(5(/)-5(cj),d) = min max M{S{p^) - S,d) (49) 

o" se[o,iogd] 

= min max {M {Smax - S,d) , M (Smin - S,d) } (50) 

> min max {M {S - Smax, d) , M {Smin- S,d) } (51) 

^ i o( Ti™ ~ Sm,ax), d 1 , (52) 

where in the third line we assumed M{\A\, d) > M(— | A|, d), which seems true numerically and 
is consistent with all previous analytical results, but which we have not proved; as it at least 



holds for the lower bounds ( 11 ), lower-bounding (52 ) with the help of ( 11 ) gives true statements 



The minimal redundancy R* from (48) equals the Shannon capacity C{T) (measured in 
nats) of the classical discrete memoryless channel T : 9 >-^ i that is defined by the transition 
probabilities T{i\0) := pf (Theorem 13.1.1 in |CT06| . originally due to [Gal 791 |Rya79| ; the proof 



uses a minimax theorem). This gives, with the same provisions as below Eq. (52): 



Proposition 12 (Lower bound on the classical Shannon capacity). For a discrete memoryless 
channel T : X ^ y , given by transition probabilities T{y\x) and with finite output dimension 
\y\ > 2, the Shannon capacity C{T) is lower-bounded as 

C{T) > mT- ^'""""'^'"^" , \y\] , (53) 
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where Smax o.nd Smin denote the maximal and minimal entropies, respectively, of any column 
T(-\x) of the transition matrix. Here, M denotes the function defined in Eq. [w, which in turn 
can be lower-bounded as in TheoremlM 



The RHS in Eq. (53), or one of its lower bounds from ( |11[ ), is easier to evaluate than 
Shannon's mutual information formula for the exact C{T) [Sha48l TCTOGj . Lower-bounding the 



relative entropies in (|48]) by Pinsker's or the H-O-T inequality ^ |Csi671 IHOT8H IAE05J . would 



lead to a linear program in the variables a. Eq. (53) also provides a more systematic way to 



obtain lower bounds on C{T) than by plugging trial input distributions into Shannon's formula. 
On the other hand, the lower bound (53) will be trivial iff all columns T{-\x) have the same 



entropy, whereas the capacity C{T) vanishes only iff all columns are themselves identical. Also, 
the lower bound in (53) can never exceed (log 2) = Ibit, since it has to hold for input dimension 
1^1 = 2 as well (or when there are only two distinct columns in T{-\x)); in the most favorable 
case Smax ~ Smin = log d, the RHS of (53 ) is actually always between 0.111 ~ 0.16 bit (for d 
and log\/3 ~ 0.80 bit (for d — ^ oo; cf. Remark |4 



2) 



In the quantum setting, identical formulas apply for the cost of the wrong code (47) and the 



redundancy (52), see [SWOOjIBWOlj . Furthermore, the Holevo quantity, which is a lower bound 
on the classical capacity of a quantum channel |NCOO| , equals the relative entropy radius of the 
channel output, i.e. the redundancy (48) over all output states [ OPW97| ISWOOj . For a quantum 



channel, however, there is no systematic way known in particular to find the minimum output 
entropy Smin efficiently; the channel output set has, e.g., generally infinitely many extreme 
points. 



3.2.2 Hypothesis testing and large deviations 

The relative entropy features prominently also in hypothesis testing and large deviation theory 
|CT06j . On the one hand, relative entropies D{a\\p) between given states a, p appear for 
example as error exponents in asymmetric hypothesis testing (in the classical Chernoff-Stein 
Lemma |CT06j as well as in its quantum analogue |HP9H lONOOj ) , such that Theorems [I] and p] 
apply immediately to yield lower bounds on error decay rate in terms of the entropy difference 
S{a) - S{p) only. 

On the other hand, in these areas one is often interested in quantities like 



dist{E,p) 



inf D(a\ 

cr&E 



(54) 



where E is some set of d-dimensional probability distributions and p a fixed distribution. Some- 
times the set E is described by an entropy constraint, for example in universal coding for 
all d-dimensional sources of entropy less than R ( |CT06j : similarly [BDK"'"05 for the quan- 
tum case): here, the decoding error probability vanishes exponentially in the message length n 
like ~ exp{—ndist{E, p)) if the true source distribution is p (assuming S{p) < R) and where 
E := {a\S{a) > R}. The decay rate, dist(£', p), may thus be lower-bounded by M{R — S{p), d) 
according to Theorems [T] and [2] simply in terms of an entropy difference. 

Finally, in symmetric hypothesis testing between two classical (commuting) probability dis- 
tributions pi, /02, the optimal error decay rate is given by the Chernoff information ^{pi, P2) = 
— logmino<s<i tr [pi/02^*] [C he52l[CT06j . which has the property that there exists a distribution 
a (from the Hellinger arc between pi and P2) satisfying £,{pi, P2) = D{a\\pi) = D[a\\p2)- Similar 



to the derivation leading up to (52), the latter quantity can be lower-bounded in terms of the 
entropy difference A(pi,p2) = S{pi) — S{p2) between the two states only: 



i{pi,p2) > 



\^{pl,P2) 



\MP1,P2) 



2(log2(d-l)+4) 3(log2(d-l) + 4)' 



(55) 
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where the last expression does not involve any extreniization (cf. Theorem^. 

Whereas for symmetric hypothesis testing between (non-commuting) quantum states pi, p2 
the basic formula for the decay rate ^{pi,P2) holds as well, the existence of a state a as above 
is not known |ANS^08| . We can therefore not apply the same reasoning to get a lower bound 
on i{pi, P2) in the quantum setting. For other kinds of (dimension-independent) bounds on the 
quantum and classical Chernoff information, see jANS"'"08[ IAudl2] . 

3.2.3 Mutual information 

Let pAB be a joint state on a bipartite system AB with respective local dimensions dA and ds 
and total dimension d = dAds (in the classical probabilistic case, pab is a joint probability 
distribution of two random variables A and B with dA and ds outcomes, respectively). Then 
its mutual information I{A : B) := S{pa) + S{pa) — S{pab) can be written as both a relative 
entropy and an entropy difference |UP93| : 

I{A:B) = S{pA® Pb) - S{pab) = - A{pab, PA'S) Pb) (56) 

= D{pab\\pa'S pb) , (57) 

where pA, Pb denote the reduced states (marginal probability distributions) for A and B, and 
in the first line we used the notation Q. 

Here we just remark that Theorem [T| which relates relative entropy and entropy difference, 
does not give any constraints in this situation: For A G [— logd, 0], which is the case here, 
it is —A > M{A,d) by Remark Isl with strict inequality except for A = — logd, 0; the fact 
D{pab\\pa^ Pb) = —A{pab\\pa^ Pb) is thus consistent with Theorem [I] and therefore ^ does 
not give new information. 

Note that I {A : B) < in.m{logdA,^og dB} in the classical case, whereas for quantum states 
I{A : B) < 2inm{logdA,^ogdB}, so that the maximum value logd = logd^^ -|- logds of 
—A{pAB, PA 'S Pb) and of D{pab\\pa 'S Pb) can be attained only in the quantum case and 
only when dA = ds with a maximally entangled state pab |NC00] . 

4 Proofs 

4.1 Proof of Theorem [l] 



Proof of Theorem^ To prove the inequality ([9]) and the optimality statement around (10), we 
will compute, for any fixed A G [— log d, log d\ , the infimum 



■mi{D{a\\p)\S{a)-S{p) = A] (58) 

over d-dimensional quantum states a, p, and show that it equals M(A,(i) from Eq. ([6| with 



optimal states a, p of the form (10). 



We first note that the infimum in (58) is attained: For A = logd, one necessarily has 
a = \jd and p is a pure state, so that D{a\p) = 00 is "attained"; on the other hand, this equals 
M(log d, d) = 00, as A = log d in ([6| enforces s = {d—l)/d and r = 0; the case A = log d is thus 
done and we exclude it from all further considerations. For any A G [— logd, logd), there exists 
a full-rank state p with S{a) — S{p) = A, and thus the intersection of the set of all pairs (a, p) 
satisfying S{a) — S{p) = A with the set of all pairs satisfying supp[(T] C supp[/9] is non-empty 
and compact, so that the infimum of the continuous function D(a\\p) over this set is attained 
and finite. For similar reasons, the infimum in ([6]) is attained. For this reasoning and for the 
argumentation below, we note that H{s) + s log(d — 1) is strictly increasing in s G [0, (d — l)/d] 
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from the value at s = to logd at s = {d — l)/d with first derivative 

— (H(s) + slog(d-l)) = \og^^(d-l) for sE (0,1). (59) 

ds s 



It is easy to see that the infimum in ( |58[ ) is attained for commuting states a and p: Fixing 
the state p and fixing all eigenvalues spec(cj) of a (which also fixes the entropy S{a)] this should 
be done to be consistent with S{a) — S{p) = A for the fixed A), the infimum (over a) of the 
relative entropy 

Dia\\p) = -S{a) +tr[i-\ogp)a] (60) 

is attained by the state a which is diagonal in the same basis as (— logp) and has its eigenvalues 
ordered in the opposite way as (— log p) [Bha97| ; as the logarithm is a strictly increasing function, 
a will thus also be diagonal in the same basis as p (and in particular commute with p), with its 
eigenvalues ordered in the same way as p. (When lank(p) < rank(spec[cr]), the infimum is +oo, 
and this as well can be attained by a o" commuting with p). This commutativity carries over to 



the infimum in (58), and implies that the bound we are about to prove will be optimal for the 



case of classical d-dimensional probability distributions (i.e. diagonal density matrices) as well. 

One can get more information about the optimal pair (a, p) from Klein's inequality, i.e. the 
non-negativity of the relative entropy. We fix again the state p and fix the entropy of a to 
equal S{a) = S, leaving the spectrum of a otherwise free; under these constraints we again 



minimize (60). In thermodynamics language (see Eq. ( |24[ ) and below), this is the minimization 
of the "energy" of a w.r.t. the "Hamiltonian" (— logp) under the entropy constraint S{(t) = S; 
by the "thermodynamic inequality", a version of Klein's inequality, it is well-known that the 
minimum is attained for a "thermal state" a ~ e^'^'-" ^^''^ i.e. a = p^ /ir [p'^], for some "inverse 
temperature" 7 G [0, +00] (here we define O'^ := 0, and p°° /ii[p°°\ is to be understood as the 
maximally mixed state on the eigenspace of p corresponding to its largest eigenvalue). 



Making this argument more precise requires some care: We consider the minimization of (60) 
under variation of both a and p with the constraints of fixed S{p) = Sp and fixed S{a) = S^ = 
Sp + /S., and denote by (ct, /?) a minimizing assignment. Only in the case S'p = can we have 
^(p\\p) — +c<^) a-nd we do not consider this case here as it is only necessary if A = logd, which 
was already discussed above. Thus DidWp) < 00, and so we have supp[a] C supp[/j|, which 
implies logrank(p) > Sa- Now, if Sa = logrank(p), then obviously a = fP /tr [fp] (i.e. a is the 
maximally mixed state on the support of p; we define O'' := 0). Second, if logrank(p) > 5o- > 
log mo, where ttiq denotes the dimension of the eigenspace of the largest eigenvalue of p (i.e. the 
dimension of the ground state space of the "Hamiltonian" (—logp)), then due to continuity 
of the entropy |Fan73j there exists 7 £ (0,oo) with S{'fP /trlfp]) = Sa- We claim that then 
a = fP /ti [fP] is the unique minimizer of ( |60[ ) under variation of a (when keeping p = p fixed). 
This is easy to see by verifying 7 {D{t\\p) — D{'fP /tv [fP] \\p)) = D{t\\pP /ir [fP]) for all states r 
with S{t) = Sa, and then using that DijW'fP /iv[fp\) > with equality iff r = 'fP/tr:['fP] (by 
Klein's inequality). Third, if S^ = log?7io, then the maximally mixed state on the eigenspace of 



the largest eigenvalue of p is obviously the unique state with entropy Sa and minimizing ( 60 ) , 
i.e. we could formally write a = yO°°/tr [p°°]- Fourth, if Sa < logmo) then a may be any state 
supported on the eigenspace of the largest eigenvalue of p. In all of these case, a and p commute, 
which was already seen before by simpler reasoning. 
For the following we can thus assume that 

a = diag{qi, . . . ,qd) and p = diag{pi, . . . ,pd) ■ (61) 
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Now fixing cr = a in (60), the minimization over all commuting states p = diag(pi, . . . ,prf) leads 



to the Lagrange function 



L{{pi},u,ii) := ^{qi\ogqi-qi\ogpi) + v^pi + ^I'^pilogpi 



(62) 



with Lagrange multipliers v and // corresponding to the nomalization and entropy constraints 
tr [p] = 1 and S{p) = Sp, respectively. We now look at this as a function of those variables pi, for 
which the corresponding element pj 7^ is positive (i.e. which lie in the interior of the domain of 
-L), and we fix the other elements pi to be zero. Then, since pi = pi is a minimizing assignment, 
by the method of Lagrange multipliers one is guaranteed the existence of P, /2 G (—00, +00) such 
that 



dL 
dpj 



{pi},i^,fj- 



Si 

Pi 



+ {d + p) + p log pj = Mj with pj 7^ 



(63) 



This excludes that the fourth case from the previous paragraph can be a minimizing case, since 
in this case there are pj = Pk = ^max{p) > and Qj 7^ q^, contradicting (63). Within the 
third case of the previous paragraph, it excludes the possibility that, apart from the maximum 
eigenvalue Xmax (p) , there could be two further distinct non-zero eigenvalues pi 7^ pj , as in the 
third case both of these would have corresponding % = qj = 0, again contradicting (63). Thus, 
in the third case above, p has at most two distinct non-zero eigenvalues, as does a. 

Also for the first and second cases of the above paragraph we now want to show that, except 
possibly when 7 = 1 (i.e. for a = p 01 A = 0), p has at most two distinct non-zero eigenvalues, 
and a as well. In these two cases, we have qj = p",- /Z for some 7 G [0, 00) with Z := YliPj > 0- 
Now define Xj := Zqj/pj = ff- for each j with pj > 0. Eq. (63) says then that, for 7 7^ 1, 
the points Xj lie at intersections of the non-horizontal affine function —x/Z + (z? + /I) with the 
function — (/i/(7 — 1)) logx (both are functions of x > 0). The latter function is either strictly 
convex or strictly concave or constant (depending on whether the prefactor is negative or positive 
or zero). The two functions can thus not intersect at more than 2 distinct points Xj > 0. When 
7 7^ 1, there can therefore be at most 2 distinct non-zero values of Xj, i.e. also at most 2 distinct 
non-zero values of pj and of qj . 



Summing up so far, any states p and a attaining the infimum in ( 58 ) commute and, for A 7^ 



have at most two distinct non-zero eigenvalues each, in such a way that distinct eigenvalues in 
a and in p correspond to each other. More precisely. 



a 



diag 
diag 



1 



1 



m 
1-r 



m 



s s 

m n 

1 — r r 

m n 



,0,...,0 
,0,...,0 



(64) 



where m,n>l,m + n<d and s,r £ [0, 1]. Permuting the entries of both states simultaneously, 
we may assume the entries of a to be ordered non-increasingly, i.e. (1 — s)/m > s/n. The above 
analysis showed further that the diagonal entries of a minimizing pair are ordered in the same 



order (see below Eq. ( 60 ) ; this also can be seen by the fact the the inverse temperature 7 above 
turned out to be always non- negative) . Thus, (1 — r)/m > r/n as well, and we will therefore 
in the following always assume < s,r < n/{m + n). Even in the case A = 0, some of the 
minimizing pairs {cr,p) have this form (choose any m, n, and s = r), and we thus assume this 
form below; similarly for the case A = logd, where e.g. m = l, n = d— l,s = {d— l)/d, r = 0. 



We can thus continue the optimization in (58) with states of the form (64). Before that, note 
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for the states in rt64l): 



Sia) = 
D(a\\p) 



H{s) + {1 — s)log'm + slogn , S{p) = H{r) + {1 — r)logm + rlogn , 



S{a) - S{p) 
D2(s\\r) = 



H{s) - H{r) + {s-r) log 



n 



m 



s 1 

slog- + (1 - s)log — 
r 1 



(65) 
(66) 

(67) 



Given A 7^ 0, let now the states a and p in (64), parametrized by s, r, m, and n, attain the 
infimum in (58). Our next goal is to show m = 1 and n = d 
Tt,m,n the state parametrized by t, m, and n, such that, for example, r. 



1. For now, we will denote by 

I = a and Tr,m,n = P 

in (64). Assume that there exist m',n' > 1 with m! -\- n' < d and n' /ml > n/m. We will then 
show that there exists some s', such that the pair of states {ts' ^m' ,n' , Tr^m' ,n') would achieve a 



strictly lower value in (58) than the pair {cr,p). For this, compute 



Sirs 



S{Tr 



H{s) - H(r) + {s-r) log 



n 



m' 



A + (s 



log 



n' /m! 
n/m 



and note the the last logarithm is positive due to n' /m' > n/m. Now, assume first A > 0. Then, 



from (66), we have s > r due to our convention s,r < n/{m + n). Thus the expression (68) is 



strictly larger than A, and because its left-hand-side is an increasing function of the argument 



s < n/{m + n) (similar to the computation (59)), there exists due to continuity some s' G (r, s) 
with 



S{Ts 



b [Tj-^r, 



A . 



(69) 



Since -D2(s|k) is strictly increasing in is first argument for s > r (assuming r > 0, which holds 
due to A < logd), we have Z)(s'||r) < D(s||r), which contradicts the optimality of the pair 
{a, p). The case A < is analogous. We have thus shown that, if we choose the parametrization 



of the optimal pair in (64) such that s < n/{m + n), then there do not exist m',n' > 1 with 
m! + n' < d and n! /m! > n/m. This implies n = (i — 1, m = l for the optimal pair (o", p). 



Using now n = d — 1, m = 1 in (64) and recalling (65)-(67), the optimal states (for A 7^ 0) 
will thus be of the form (10), where (^s^r) attains the minimum in ([6|; for A = 0, the optimal 
states can be chosen to be of that form. The preceding proof shows also that, for A 7^ 0, the 



optimal states are necessarily of the form (10), up to simultaneous unitary transformations of a 



and p; the proof in Section [4.2| shows furthermore that, for each A 7^ 0, the optimal s and r are 
unique. For A = 0, the optimal pairs are obviously exactly the ones with a = p. D 



4.2 Proof of Theorem [2] 

Proof of Theorem^ M{A, d) > is clear, and the stated values are argued below Eq. ([5]). 

For A^ = N{d), the first inequality in (11) is just Lemma 13, and for N > N(d) it follows 
from the monotonicity of the lower bound: 



_d_ 
dN 



A 

NeN 



N - A 



A 

eiv 



_ A 

e N 



A 

N 



< 0, 



(70) 



since the square brackets is non-negative due to convexity of the exponential function. For any 



A^ and A, the second inequality in (11) is easily verified by subtracting both sides from each 
other and observing that the difference and its first three derivatives w.r.t. A vanish at A = 0, 
whereas the fourth derivative is positive everywhere. If one defines, as usual, the minimum over 
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an empty set in ([6]) to be oo, then the lower bounds (11) hold even for A outside the range 
[-\ogd,logd]. 



To prove (12), we use the left inequality in (11) and find a constant c > satisfying 
iV(d)e^/^W - N{d) - A > AVclog^d for all d > and A G [-logd,logd]. At fixed d, 
this inequality holds for all A iff it holds for A = — logd. For large d — )• oo, one sees by Tay- 
lor expansion and by N{d) = N^ + 0(1) = (log^(i)/4 + 0(1) that any c > 1/2 is sufficient. 
Examining small d >2 numerically, one sees that c = 3 is sufficient. 



For the convenient upper bounds on N^, see Lemma 14 



We now sketch a proof of strict convexity (and continuous differentiability) of M{A,d), 
which is somewhat involved. For this, we employ its definition ([6|, will somtimes abbreviate 
D := log(d — 1) > 0, and denote by r^ the (unique) r G (0, 1/2) attaining the maximum in ([T]), 
i.e. satisfying (1 — 2rd) log -y^{d — 1) = 2. We also define jd G (0, {d — l)/d) to be the unique 

solution of (1 — 7d) log —^{d — 1) = 1; one can check 'ja > r^. 

If, for some A = a; G (— logd, logd), a pair (s,r) G (0, {d — l)/d)^ attains the minimum in 
(|6]), then by the method of Lagrange multipliers the following two equations hold: 

A(s,r) :=H{s)-H{r) + {s-r)D = x, (71) 

F{s,r) := hog i^ - log ^ Vi? + log ^ j - r^ - ^ j (■ i? + log ^ j =0, 

(72) 

where the latter equality expresses the requirement that the gradients of the target function and 
the constraint function be parallel (i.e., that the 2 x 2-matrix formed by these gradients have 
vanishing determinant). In a small enough neighborhood of any such pair (s, r) G (0, {d— l)/d)^ 



with s j^ r, the equations ( 71 )-(|72|) are sufficiently well-behaved to have a unique solution 



{s{x'),r{x')) for any x' G {x — e,x + e), as the solution of the differential equations obtained 



from (71)-(72). For any s = r, (71)-(72) are satisfied with x = (corresponding to the trivial 
optimality cases a = p), but near any such point there are no other pairs with F{s,r) = and 
s ^ r (as one sees from a quadratic expansion of F{s, r)) with the exception of s = r = r^^: around 



X = and s = r = r^, the equations (71)-(72) have a solution with s{x = 0) = (1 — 2rd)/3 



r{x = 0) = —(1 — 2r^)/6 (overdots denote derivatives w.r.t. x), which can be seen by computing 
third directional derivatives of F{s,r) at this point. 

Examining the equation F{s, r) = for (s, r) G (0, {d — l)/d)^ (by way of discussing F{s, r) 
and its derivative Fs{s, r) along each fixed r) and furthermore considering optimal pairs (s, r) for 
any A = x in (pi) on the boundary of [0, {d — l)/d]'^, one finds the following: For r = 0, optimal 
pairs are obtained for s = and for s = (d — l)/d (where x = logd); for < r < r^, optimal 
pairs are obtained for s = r and for one other value s G (r^, (d — l)/d) (where < a; < logd); 
for r = rd, the only optimal pair is obtained for s = Vd (where x = 0); for rd < r < 7^, optimal 
pairs are obtained for one value s G (0, r^^) (where x G (Ar,0), where we define A,. := A(s = 
0, r = 7rf) = 1 — A + log7rf G (— logd, 1 — logd)) and for s = r; for 7^ < r < (d — l)/d, optimal 
pairs are obtained for s = (where x G [— logd. A,.]) and for s = r. 

Combining this with the above differentiability result and defining s(0) := r(0) := rd for 
X = while disregarding the other optimal pairs with s = r, we get the following: For any x G 
[— logd, logd] \{0} there exists exactly one optimal pair (s(x),r(x)) (i.e. with A(s(x),r(x)) = x), 
the curve (s(x),r(x)) is continuous in x G [—logd, logd], and differentiable in x G (Ar,logd). 
Thus already, M{x,d) = D{s{x)\\r{x)) is continuous in x G [— logd,logd] (with the usual 
convention lim^./^iogd M(x,d) = 00 = M(logd, d)). 

We can now finally prove strict convexity of Af (x, d). First, for x G [— log d, A^], it is ■s(x) = 0. 
One can thus explicitly write A = —H(r) — Dr as a function of M = M(x, d) = D2{s = 0||r) = 
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— log(l — r) in this range of A = x; the function A = A(M) is easily seen to be continuously 
differentiable, strictly decreasing and strictly convex in this range. Its inverse M = M(A, d) 
is thus strictly convex as well and continuously differentiable in A G (— logd, A,.], and one can 
compute dM/d/S. |A=Ar = ~1 (and dM/d/S. |A\-iogd = — oo). 



Second, for x € (Aj.,log(i), the optimal pairs {s{x),r{x)) G (0, (d — l)/d) satisfy (71)-(72). 
We can thus compute 

^M{x,d) = ^D2{s{x)\\r{x)) (73) 

l-r(x) , I - s(x)\ ., , f s(x) 1 - s(x)\ . . _^. 

log ^ - log ^ s{x) - ^ - -, H ^^ ^4 

r{x) s[x) J \r[x) 1 — r[x) J 

l-r(x) , I - s(x)\ / , l-s(x)\~ _^^ 



where in the last step we used (72) and the derivative of (71) w.r.t. x. Notice for later that 



dM{x,d)/dx \x\Ar = ~1 since s{x) \ and r(x) — )• 7r for x \ A,.. Thus, 

Z) + log -^ -^M(x,d) = D + log- ^^^ ^' 



s{x) J dx"^ ' \ r(x) / s(a;)(l — s(x)) 

1 — s(a;)\ r{x) 

s{x) J r{x){l — r(x)) 

Strict convexity, d'^ M {x , d) / dx"^ > 0, would thus follow from s{x) > and r(x) < 0; to see 
the last implication, note that not both of s{x) and r(a;) can vanish simultaneously because of 
dA{s{x),r{x))/dx = 1 > 0. The last insight also shows that s{x) < and r(x) > cannot both 
be true simultaneously unless s{x) = f{x) = 0. It thus suffices now to show that s{x) and r{x) 
cannot both be simultaneously positive nor both be simultaneously negative. For x = 0, this 
was remarked above. For x S (A^., logd) \ {0}, we show it in the following way. 



Differentiating (72), one has 



= -—Fis{x),r{x)) = Fsisix),rix)) six) + Fr(s(x),r(x))r(x) . (77) 

dx 

The considerations of the equation F{s,r) = above show that Fs{s{x),r{x)) > for s{x) ^ 
r{x). Finally, the fact that s(x) > r{x) implies r{x) < r^ and the fact that s{x) < r{x) implies 



r(x) > r^ (see above) can be used, together with (72), to show Fr{s{x),r{x)) > for s{x) / r{x). 
( |77[ ) then implies that not both of s{x) and r(x) can have the same sign. 

M(x, d) is thus strictly convex in x G (A^, logd), as well as in x G [— logd, A^]. Since M{x, d) 
is continuous with matching left-sided and right-sided derivatives at x = A^ (see above), it is 



strictly convex in the whole range x G [— logd,logd]. Continuity of (s(x),r(x)) and Eq. (75), 
together with the above considerations of the range x G [— logd, Ar], finally prove continuous 
differentiability of M[x,d) in x G (— logd, logd). 

D 

4.3 Auxiliary Lemmas 

Lemma 13 (Simple lower bound on M{/S.,d)). For 2 < d < oo and A G [— logd, logd], the 
quantity M{A, d) from Eq. Im) is lower-bounded as follows: 

MiA,d) > iV(d) (^e4j) - 1 - -A^^ , (78) 

where N{d) is defined in Eq. ^. 
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Proof. Define the function A(s, r) := H{s) — H{r) + [s — r) \og{d — 1). To show Lemma 13 we 
will prove 



G(s,r) := D2{s\\r) - N{d) [e 



A(s,r) 



1 



A(s,r) 
N{d) 



> 



(79) 



for all s,r £ [0, {d— l)/d]. The statement is easily verified for r = 0, since D2(s||0) = +oo unless 
s = 0. We thus fix r G (0, {d—l)/d] from now on, so that G{s, r) is a function of s G [0, {d—l)/d]. 
At s = r, the function G{s = r, r) = vanishes, as does its first derivative 



ds 



G{s,r) 



log 



1-r 



log 



1-s 



A(3,r) \ 1 — S 

e ^w - 1 I log (d-l) 



. 



Furthermore, G{s,r) is convex in s E [0, {d — l)/d] since, for s £ (0, {d — l)/d], 



ds'^ 



G{s,r) 



A(s,r) 
gTV(3r 



1 



N{d)s{l-s) 



N{d) - s(l - s) { log - — -{d - 1) 



> 



(80) 



(81) 



as the term in square brackets is non-negative due to the definition of N{d) in Eq. ([7]). 

All of this together shows that, for each fixed r G [0, {d — l)/d], G{s, r) attains its minimum 
at s = r, which finally proves (79). D 



Lemma 14 (Simple bounds on N{d)). For d >2, the optimization N(d) from Eq. 1^ is upper- 
bounded in the following ways: 



^log\d-l) < N{d) < Nd = ^log2(d-l) + l 



Nd-l 

N{d) < log2 d , 

where N^ in the first inequality was defined in Eq. (H). 



(82) 
(83) 



Proof. To prove the upper bound in (82), we show that for all r S [0, 1] 

1 . o . . . . . /. 1 - r 



< ^ log2(d - 1) + 1 - r(l - r) ( log ^-^^(d - l] 



(84) 



For r = 0, 1 this is clear due to the convention • oo = (or by continuity), and for r = 1/2 it 
is easily verified. Let thus r S (0, 1) \ {1/2}. The right-hand-side of (|84|) equals 



r log2(d-l) -2r(l-r) lo. 



1 — r 



^ - r") log(d - 1) - ^^y-^ j + 1 - r(l - r) ( lo; 



log(d - 1) + 1 - r(l - r) I lo, 
1 — r\ r^(l — r 



1 — r 
r 



1-r 

„ , log 

(i-r)^ V r 



> 



1 



[1 - 2rf 



;i - Irf - r(l - r) lo, 



1 — r 



(85) 



where the inequality arises by omitting the non-negative first term (. . .) from the step before. 
Now, the last expression does not depend on the dimension d anymore, and one can show 
that it is positive for all r S (0, 1) \ {1/2}. This is numerically easily verified, or analytically in 



the following way: The term in square brackets in (85) vanishes at r = 1/2, as do its first three 
derivatives w.r.t. r, whereas its fourth derivative 



dr^ 



from (85) 



2(l-2r)2 ^ /^ i_r\ i6r(l-r)+4(l-2r)2 
+ 47:^ '- + (1 - 2r) log ^ ^ ^ ^ ^ 



ry^ [ \ rv' \A T* I 1 7^ 1 



iv>0 [ 1 T" 1 
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is strictly positive for all r G (0, 1), since (1 — 2r) log ^^ > for r G (0, 1). 

The lower bound in (82) follows by letting r — >■ 1/2 in the definition M of N{d). 

To prove (83), we look for a constant c satisfying N{d) < clog^d. For large d — >• oo any 
c > 1/4 works by (82). Examining small d>2 numerically, one finds that c > 0.92 suffices. D 



4.4 Proof of Theorem I 



Proof of TheoremlR For fixed d > 2, we maximize the expression on the LHS of (20) or (19) 



over all probability distributions {pi} (i.e., spectra of p), which leads to the Lagrange function 



L{{Pi},'^) 



^Piilogpif - f "^pilogpi I + i^'^Pi , 

i \ i / i 



(86) 



with the Lagrange multiplier z^ corresponding to the normalization tr [p] = 1. Assume now that 
{pi} (corresponding to the state p) attains the maximum of (|19|) over all probability distributions 



{pi} (due to continuity and compactness, this maximum is attained). We now view (86) as a 



function of those variables pi for which pi > 0, fixing the other elements pi to be zero. Then, 
due to the extremality of {pi} and having components in the interior of the domain of L, the 
method of Lagrange multipliers guarantees the existence of z/ S (— oo, +oo) such that 



dL 

dPj 



{Pi},!^ 



{logpjf + 2logpj - 2 I ^Pi logpi I (1 + logpj) + P 



(87) 



{sm}) + 1 + ^ogpif - isamf + ^ - 1 



\/j with pj > , 



where the quantity S{{pi}) = S(j)) denotes the entropy of the distribution {pi} and in particular 



does not depend on the index j. Thus, the equality (87) implies that 

u + l - S{p) 



log Pj 



{s{p)y 



1 



Vj with Pj > 



so that strict monotonicity of the logarithm yields that there can be at most two distinct non-zero 
elements in {pi}. 

Thus, leaving off hats again, an optimal p = p has the form 



P 



1-r 



m 



1 — r r 



5 ! 

m n 



n 



0,...,0 



(89) 



with m,n > 1, m + n < d, r £ [0, 1]. W.o.l.g. we can assume r < 1/2 by permuting the entries 
of p. For such states one has, after a small calculation, 

(1 — T Tl \ 

log h log — . (90) 

r mj 

Maximizing this, for any fixed r £ [0, 1/2], over m and n yields n = d— 1 and m = 1. Maximizing 



(90) finally over r gives a unique r = r^ G (0,1/2), namely unique the value of r G [0,1/2] 
satisfying (1 — 2r) log ^^{d — 1) = 2, and the maximum of (90) is N{d) from |7|. 

The inequality (21) is shown by Lemma 14, which completes the proof of Theorem^ D 
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